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Abstract 

A functional approximation to implement Bayesian source separation analysis is intro- 
duced and applied to separation of the Cosmic Microwave Background (CMB) using WMAP 
data. The approximation allows for tractable full-sky map reconstructions at the scale of 
both WMAP and Planck data and models the spatial smoothness of sources through a 
Gaussian Markov random field prior. It is orders of magnitude faster than the usual MCMC 
approaches. The performance and limitations of the approximation are also discussed. 

1 Introduction 

Source separation is one of the initial data processing tasks for multi-channel image data, such as 
have been obtained at microwave frequencies by COBE, WMAP and more recently Planck. The 
goal in this case, and the application of focus for this paper, is to reconstruct the CMB signal by 
separating it from other sources. Additionally, maps of the other sources may be obtained and 
of interest. 

In this paper we propose a new method of implementing Bayesian inference to source sep- 
aration, based on a discrete grid approximation to a posterior density, and apply it to CMB 
data. The method is substantially faster than the usual sampling-based approaches to Bayesian 
inference, allows for full-sky source reconstructions of data of the size of WMAP (w 3 x 10 6 pixels 
at 5 channels) in practical amounts of time, and should remain feasible for data that is an order 
of magnitude larger, at the higher resolution of Planck. Further, the approach permits spatially 
smooth priors to be specified for the sources through a Gaussian Markov random field. 

Bayesian source separation computes the posterior distribution of source components given 
data. Several approaches based on factor analysis or independent components analysis (ICA) 
models have been proposed e.g. Hobson et al. (1998); Erikscn et al. (2006); Wilson et al. (2008); 
Kuruoglu (2010). The advantages of the Bayesian approach are the ease with which domain 
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knowledge can be used in the analysis through the specification of the prior distribution, and 
the coherent treatment of uncertainty which leads to proper estimation of the uncertainties in 
the source components from the uncertainties in the model and data. The former is particularly 
useful in this context as so much is known about the sources and how they contribute to the 
data maps at different channels; the incorporation of such information can greatly improve the 
separation (Wilson et al., 2008). 

The principal disadvantage of Bayesian methods is computational. The usual approach to 
computing the posterior distribution is through Markov chain Monte Carlo (MCMC) sampling 
(Gelman et al., 2003, Chapter 11). For the model considered here, the computational and storage 
requirements of an MCMC solution make it impractical to consider separation at the scale of 
complete maps of WMAP data, as well as implementation of standard statistical diagnostics 
for model assessment like cross validation, even with partitioning the data into smaller regions. 
MCMC can also suffer from problems of slow convergence and exploration of the support of the 
posterior distribution. Nevertheless there has been some progress in MCMC methods; Kayabol 
et al. (2009) implemented a Metropolis algorithm for a non-Gaussian Markov random field prior 
on the sources, which in Kayabol et al. (2009) is speeded up considerably by the use of a Langevin 
sampler. 

Functional approximations to high-dimensional posterior distributions, rather than sampling- 
based approximations, are an alternative that have gained some prominence in the last 10 years 
or so. They can be substantially faster than MCMC. Variational Bayes (VB) is one approxima- 
tion that has seen some application to source separation (Winther and Petersen, 2007; Cemgil 
et al., 2007), where the idea is to find an approximating distribution to the posterior that is 
close in the sense of Kullback-Leibler divergence (Jordan, 1998). VB relies on factorising the 
approximating distribution for a tractable algorithm, which tends to lead to under-estimation of 
posterior variances, although means are in general approximated well (Wang and Titterington, 
2004). 

This paper demonstrates that in fact a relatively unsophisticated discrete approximation is 
sufficient in the case of a Gaussian likelihood and a Gaussian Markov random field prior for the 
sources, as long as the number of hyperparameters in the model is not too large. Later we discuss 
how these assumptions can be relaxed by generalising the approximation to the integrated nested 
Laplace approximation (INLA) of Rue et al. (2008). Although our approach is restricted to a 
much smaller class of models than VB, INLA has been shown to be both fast and accurate within 
this class. 

Section 2 describes the factor analysis model and Section 3 discusses prior specification for 
the Bayesian inference. Section 4 describes the approximation that allows computation of an 
approximation to the posterior mean of the sources, which is then illustrated in Section 6 by 
analysis of the 7-year WMAP data into 4 sources. 
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2 Model 



The data consist of images of intensities at n/ frequencies v\, ■ ■ ■ ,v nf over the sky at J pixels. 
The data at pixel j are denoted yj G R nf , j = 1, 2, • • • , J, while Y k — (yik, ■ ■ ■ , yjk) T denotes 
the all-sky image at frequency v k . There are n s sources. The vector of source components at 
pixel j is denoted Sj G R" s and the image of source i is Si = (su, . . . , sj,) T . 

We assume the standard statistical independent components analysis model for yy. 

yj = Asj + ej, j = 1,..., J, (1) 

where A is an n/ x n s mixing matrix and ej is a vector of nf independent Gaussian error terms 
with precisions r = (n , • • • , r n/ ) . 

Stacking the Y fe and Si as Y = (Y? , . . . , F„^) T and S = (Sj , . . . , S^J T , and stacking the 
error terms by frequency E — (en, . . . , ej\, e\2, ■ ■ ■ , ej nf ), Eq. 1 can be rewritten: 

Y = BS + E, (2) 

where B = A ® Ij x j is the Kronecker product of A with the J x J identity matrix. 

It is common for each pixel to be observed more than once, and the scanning schedule of 
the detector means that different pixels may be observed a different number of times. Define 
rij to be the number of times that pixel j is observed. Where this occurs, the Gaussian error 
assumption implies that the probability distribution for the rij observations of yjk is equivalent 
to a single observation that is the average of the observations with precision ejk — njTk; like this 
we consider each yj k to be observed once and E is zero-mean Gaussian with precision matrix 

C = diag(n 1 r 1 , . . . , njn, n 1 T 7lf njT nf ). 

Uniqueness of the solution for A and S is forced by setting a row of A (the fourth row here) to 
be ones. 

Four sources are assumed in this work: CMB, synchrotron, galactic dust and free-free emis- 
sion. A parameterisation of A is assumed, following Eriksen et al. (2006). The first column of 
A is the contribution of CMB and is assumed known (black body) : 

Wk \ 2 cxp(nis k /k B T ) 
k B To) (exp(nv k /k B T ) - l) 2 ' 

T = 2.725K is the average CMB temperature, r\ is the Planck constant and k B is Boltzmann's 
constant. It has no free parameter. The second column is for synchrotron radiation and has 
entries of the form 

Ak2 



Aki = g{vk)/g("4,), k = 1, • • • ,n f , where g{v k ) 



v k ' 



for a free parameter 9 S , the third column is for galactic dust and has entries of the form 

A ^ = exp^VfcgTl) - 1 ' ! 

3 cxv{riv k /k B Ti) - 1 \v± 
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for a free parameter 9d, where T\ = 18.1K , and the fourth column is for free-free emission and 
has entries of the form 

and has no free parameter. Hence A is parameterised by 9 S and 9d- 



3 Prior 

Prior for 6 s ,0d- Prior studies give ranges for the mixing matrix parameters: —3.0 < 9 S < —2.3 
and 1.0 < 9d < 2.0 (Eriksen et al., 2006). This information is quantified as independent uniform 
distributions on these ranges. 

Prior for the sources: Independent intrinsic Gaussian Markov random field (GMRF) priors 
are used for each source Si. These priors impose spatial smoothness on Si by inducing conditional 
independence of a pixel on the others given its neighbours. In this paper the first order intrinsic 
GMRF (Rue and Held, 2005, Chapter 3) is used, which imposes that the differences 

are independent zero-mean Gaussian with precision where c(j) is the set of pixel indices of 
the four nearest neighbours of pixel j. This leads to a distribution of Si that is of zero-mean 
multivariate Gaussian form: 

p(Si | <j>i) cx |Q(^)| - 5 exp (-0.5Sf Si) , (3) 

where Q(4>i) is a J x J matrix that can be written as Q(<f>i) — 4>iD T D, where the elements of 
D are defined as: 

i, if h e c(ji) 

0, otherwise, 



for ji ^ j 2 and main diagonal elements are Djj = — ^i=i The term intrinsic GMRF comes 
from the fact that Q{4>i) is not of full rank, hence Equation 3 is not a well-defined probability 
density function. However, the posterior distributions of the Si will still be properly defined; 
again, see Chapter 3 of Rue and Held (2005). 

Let * = (9d, 6 S , (f>i, . . . , (f> ns ) denote all the hyperparameters in the model. The distribution 
of the stacked vector of sources is then 

p(S | *) cx |Q(M>)| - 5 exp (-0.55 T Q(*) S) , (4) 
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where 

/ Q(<h) ••• \ 



o 0(02) ••• 



Prior for the </>j: Independent gamma distributions are used. The density function is of the 
formp(^) oc ^~ x e~ ai '^ i for positive hyperparameters a, and &j. Default non-informative values 
are bi = 1 and very small, otherwise prior knowledge about the degree of variation in each 
source can inform the choice using, for example, that the mean and standard deviation of this 
distribution are bi/di and \fbi/a,i respectively. 

Prior for the -77-: The Tk are assumed known. This is a reasonable assumption for microwave 
maps, based on data from detector calibration. 

4 Posterior Calculations 

The unknown quantities in this model are S and The posterior distribution is then: 

p(S,¥\Y) oc p(Y\S,¥)p(S\9)p(¥) 

= p(Y\S,9 d ,0 s ) (jlpiSilfr^J ( [ p{e s )p{6 d ) fjpO^; (5) 

all these terms are defined in Sections 2 and 3. 

The aim is to compute K(S\Y), the posterior expectation of the sources. For this, an 
approximation to the marginal posterior distribution of \I> is needed first. 

4.1 Discrete approximation of \ Y) 

Simple manipulation of the multiplicative law of probability shows that for any S such that 
p(S\Y,*)>0, 

The numerator terms of the right side of Eq. 6 are given in Eq. 5 and the denominator term is 
easily shown to be Gaussian: 

p(S \Y,9) = (2tt) - 5 "^ |Q* 0J>)|°- 5 cxp(-0.5(S - M * (*)) T Q* - /x*(*))), (7) 

where the precision and mean are 

Q*(¥) = Q(*)+B T CBand (8) 
= Q*(*)- 1 B T CY (9) 
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respectively. A numerically stable value of S at which to evaluate Eq. 6 is arg maxg p(S \ Y, \I>) = 
which gives the definition of a function q(ty \ Y): 

p(V\Y) oc |Q*(*)r°- 5 p(^|S' = / x*(*),*)p(S' = =g(*|F). 

Evaluation of ( \I> | y) requires the computation of the determinant and inverse of Q*(^) whose 
dimension (n s J x n s J) is prohibitively large. What is possible is to define qw over a smaller 
window W of pixels: 

qw(*\Y w ) = \Q* w (*)\-°- 5 P (Y W \S W = ti* w (*),*)p(S w = | 

where Y w and S^y are the elements of Y and 5 over the pixels in W. The matrix Q w (^f) 
is the precision matrix of Sw given Yw and * and follows Eq. 8 with Q(*), B and C re- 
placed by their submatrices Qw(^), Bw and Cw corresponding to the pixels in W; = 
QV(*)" lB w c ^Y lv follows Eq. 9 similarly. The size of the window W is chosen so that both 
\Q W (^)\ and Q* w {^y l can be computed. 

Now | Y\y) can be derived numerically by computing the proportionality constant 

'I q w (*\Y w )d*) 

to obtain it from qw- This is done by evaluating qw over a discrete set Q of values of '3/ . The 
set is defined by an initial exploration of qw to find a mode with respect to 'U', then exploring 
around that mode to find a high probability region. Here, the Hessian of log(<ftv(* I Y)) with 
respect to \I> is computed at the mode and Q formed by taking points out along each parameter 
axis, at intervals equal to the square root of the inverse of the Hessian, until log(g^(* | Y)) is 3 
less than its value at the mode. Rue et al. (2008) recommend exploring along the eigenvectors of 
the Hessian, which may be more efficient. The proportionality constant is approximated by the 
Ricmann sum over Q thus: 

where A* are volume weights. 

4.2 Approximate evaluation of E(S \ Y) 

The sources are reconstructed using the posterior means. In principal one wants to compute the 
E(Sij | Y) but this would require the evaluation of q(^ \ Y). Instead, the posterior means over 
W are computed via the conditional expectation formula and Eq. 10: 

E(Sw\Y w ) = Ev lYw (E(Sw\Yw,y)) 
»w(*)p(*\Y w )d* 
E*eQ»w(*)<lw(*\Yw) A* 



(ii) 
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Figure 1: Simulated example: the 3 sources. 



and these used as an approximation to E(Sij | Y) for any j e W. 

5 Simulated data example 

As an illustration, 3 sources were mixed into 6 components according to the matrix 

0.20 \ 
0.34 
0.63 

1.00 ' 
1.55 
2.51 / 

which corresponds to the mixing components of CMB, synchrotron and galactic dust respectively, 
as described in Section 2, at 30, 44, 70, 100, 143 and 217 GHz, with 9 S = -2.8 and d = 1.4, 
normalised so that the 4th row is made of ones. Figure 1 and 2 shows the ground truth and data 
respectively. The prior distributions for 9 S , 9 d and fa follow those described in 3. This problem 
is sufficiently small (n s = 3, J = 256) to be solved without blocking. The parameter vector <fr is 
of dimension 6 with the grid Q composed of about 50,000 points. Figure 3 shows the posterior 
means of the 3 sources calculated via Eq. 11. As a comparison with other common methods of 
source separation, in Figure 4 arc scatter plots of true versus estimated source pixel values via 
standard least squares and fast ICA, as well as Baycsian inference implemented by MCMC and 
the approach of this paper. We see that the Bayesian method produces the most accurate result 
for either implementation, but it is noted that the approach of this paper is substantially faster 
than MCMC. 

6 Analysis of 7 year WMAP data 

The seven year WMAP data was analysed using the procedure of Section 4. WMAP data consist 
of 5 images of J = 3 x 2 20 = 3, 145, 728 pixels (see Figure 5) which were divided into 6144 blocks 
of 512 pixels for the analysis. 
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Figure 2: Simulated example: the 6 observed images. 




8 



LS 



fastICA 



MCMC 




This paper ~™ ~™ "•* ~™ 

Source 1 Source 2 Source 3 

Figure 4: Scatter plots of observed versus posterior means for the 3 sources with 4 algorithms: 
least squares, fast ICA, Bayesian implemented by MCMC and Bayesian implemented by the 
approach of this paper. 
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Figure 5: The 7 year WMAP data. 



Separation into the 4 sources described in Section 2, following the method described in Sec- 
tions 3 and 4, was implemented. The parameter vector \& = (9<j,9 s ,<f>i, . . . ,4>±) has dimension 
6 and, for the computation of qw{^ I Y\v)i the grid Q was computed as described in Section 4 
which led to a grid size of at most 50,000 points and sometimes much smaller. On the same PC 
as was used for Section 5, the computation of each pw{^ I Yw) an d K(Sw | Y\v) through Eqs. 
10 and 11 took about 40 seconds with MATLAB code. On a single processor, this equates to 
about 72 hours to complete the full map. The most time-consuming operation was the Cholesky 
decomposition used to compute Q^r(\If) _1 . It is noted that processing of different blocks can 
be done in parallel, so there is great potential to reduce the total computation time if more 
processors are available. 

It has been noted that a successful separation can be achieved when the mean of the prior 
of the 4>i differs greatly but that these priors must have small variance, otherwise maps of the 
posterior expectations are not smooth and contain several large outlier pixel values. Figure 6 
shows the posterior expectation of CMB for 3 different priors on the <f>i with means of 1, 5 and 
10, corresponding to weak, medium and strong spatial smoothness. The effect of the prior on 
4>i is clearly seen in the resulting separation. Figure 7 shows the posterior means of the other 
separated sources where the prior mean of each <pi is 10 (strong spatial smoothness), and Figure 
8 shows histograms of the posterior means of the components of \P, or their logarithm, over the 
6144 blocks. Expectation of log parameter values are used in most cases for a clearer plot. It is 
seen that the prior on the <pi has a strong influence on the posterior mean; the expectations of 
log(0j) are remarkably consistent over the different blocks. 
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Figure 6: Posterior means of CMB for WMAP data from Eq. 11 with (from left to right) a prior 
mean for the spatial smoothness parameters </>, of 1, 5 and 10. 




Figure 7: Posterior means of (from left to right) synchrotron, galactic dust and free-free emission 
for WMAP data from Eq. 11 with a prior mean for the spatial smoothness parameters <f>i of 10. 

7 Discussion and Conclusion 

This paper has outlined a relatively straightforward method of approximating the posterior means 
of sources in a Gaussian source separation problem. By dividing the data into blocks, it can be 
used to conduct source separation in a reasonable time for data of the scale of WMAP and 
Planck. The blocking allows, particularly if parallel computation is available, an algorithm that 
is orders of magnitude faster than an MCMC approach. 

From the results in Figures 6 and 7, the most obvious feature is that the galactic plane still 
causes considerable difficulties. So while this is a completely automatic algorithm, it will still 
require manual processing about the galactic plane. It is also noted that there does not appear 
to be an obvious block effect except near the galactic plane; a smooth reconstruction is in general 
obtained. Block effects can be smoothed out in various ways, such as taking a moving average 
or averaging over overlapping blocks. 

While the use of blocks is a necessary approximation, it is noted that a further restriction 
on the method is that the dimension of \& — the vector of mixing matrix and prior source 
parameters — must be small enough to allow a discrete grid to be stored and qw computed on 
it in a reasonable time. It has been shown that this is tractable for 4 sources, with 6 parameters. 
The addition of an extra source adds 2 parameters to \& — one for its mixing matrix column 
and one for the IGMRF prior — so that the existing method becomes intractable for a much 
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Figure 8: Histograms of the posterior means of the elements of SJ r , or their logarithm, over the 
6144 blocks for the case where the prior mean of the 4>i is 10. Average of the expectations are 
shown above each plot. 
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larger number of sources. For example, for 6 sources one would have 10 components in which 
would allow little more than a grid of 3 points along each dimension (3 10 points in Q). For more 
sources, one option is to make a further approximation by forcing independence between the two 
set of components in the mixing matrix parameters 0i and the IGMRF parameters fa: 

p(*\Y) = P (9\Y) P (<t>\Y), 

and then compute independently p(0 \ Y) and p(4> \ Y) on separate grids of lower dimension. 
This would allow implementation of the algorithm to 6 sources comfortably. 

Another observation is that the serial computation time is inversely related to the block size. 
Suppose there are K blocks. The dominant computation is the Cholesky decomposition of the 
matrices Q^(\I>), which are of dimension n s J/K x n s J/K. Computation of this decomposition 
is of order at worst (n s J/K) 3 , and at best (n s J/K) 2 if can be written as a band matrix, 

and there are K of them, so in terms of K the total computation time is of order 1/K to 1/K 2 . 
So using smaller blocks is quicker, but this is clearly at the expense of an accurate approximation 
to E(S | Y). The block size of 512 pixels used here is a compromise with the longest computation 
time that we can justify. For example, when using 24,576 blocks of 128 pixels, the computation 
time per block is about 0.75 seconds which gives a total serial computation time of about 5 hours. 
This compares with a total time of 72 hours for blocks of 512 pixels. 

It is also noted that the identity used in Eq. 6 can be used in cases where the likelihood 
p(Y | S, is not Gaussian. In this case a Gaussian approximation to the denominator term 
p(S | Y, \I/) can be found by equating a mean and precision to its mode and curvature at the 
mode. The rest of the method of computing E(SV I Yw) is identical. This is the integrated 
nested Laplace approximation of Rue et al. (2008) and has been shown to be very accurate in 
a wide range of latent Gaussian models. This would allow, for example, the use of non-linear 
relationships between Y and S, or non-Gaussian measurement error. 
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